Modeling of ionic liquids viscosity via advanced white-box machine learning

Ionic liquids (ILs) are more widely used within the industry than ever before, and accurate models of their physicochemical characteristics are becoming increasingly important during the process optimization. It is especially challenging to simulate the viscosity of ILs since there is no widely agreed explanation of how viscosity is determined in liquids. In this research, genetic programming (GP) and group method of data handling (GMDH) models were used as white-box machine learning approaches to predict the viscosity of pure ILs. These methods were developed based on a large open literature database of 2813 experimental viscosity values from 45 various ILs at different pressures (0.06–298.9 MPa) and temperatures (253.15–573 K). The models were developed based on five, six, and seven inputs, and it was found that all the models with seven inputs provided more accurate results, while the models with five and six inputs had acceptable accuracy and simpler formulas. Based on GMDH and GP proposed approaches, the suggested GMDH model with seven inputs gave the most exact results with an average absolute relative deviation (AARD) of 8.14% and a coefficient of determination (R2) of 0.98. The proposed techniques were compared with theoretical and empirical models available in the literature, and it was displayed that the GMDH model with seven inputs strongly outperforms the existing approaches. The leverage statistical analysis revealed that most of the experimental data were located within the applicability domains of both GMDH and GP models and were of high quality. Trend analysis also illustrated that the GMDH and GP models could follow the expected trends of viscosity with variations in pressure and temperature. In addition, the relevancy factor portrayed that the temperature had the greatest impact on the ILs viscosity. The findings of this study illustrated that the proposed models represented strong alternatives to time-consuming and costly experimental methods of ILs viscosity measurement.

www.nature.com/scientificreports/ the analytical data, fortifying the conclusions drawn.Strict standards for experimental data from open literature sources significantly contribute to the reliability of the results, highlighting the commitment to data dependability.
The dataset was randomly split into training (80%) and test (20%) subsets, ensuring that the test set remains undisclosed during parameter adjustments for independence.The application of k-fold cross-validation to the training subset played a pivotal role in this investigation.This approach ensures that each observation in the dataset is included in both the training and validation sets.The deliberate use of 6 k-folds for all models was strategic, with the choice depending on the data size-striking a balance between avoiding excessive or insufficient fold sizes.The train data underwent random partitioning into 6 folds, where the model fitting involved K-1 folds (i.e., 5 folds), and validation was conducted using the remaining fold.

Model development
Using Eyring's Theory (ET) to calculate Pure Viscosity Kirkwood et al. have come up with a strong theory regarding how monatomic liquids transport 49 .This idea itself, however, does not provide immediate results.The absolute rate idea was suggested by Eyring et al. 50,51 .The individual molecules are always moving in a pure liquid at rest.But because the molecules are closely packed inside a "cage," the motion is mostly limited to the vibrations that each molecule generates in response to its nearest neighbors.The height-energy barrier G + 0 N A is this "cage" where N A stands for the Avogadro number (molecules/g-mol).Additionally, in order to "escape" from the stationary fluid cage, G + 0ˆ, or a molar-free activation energy, is required.(Fig. 1) 25,51 .Following Eyring's theory (ET), a molecule escapes its "cage" into a resting liquid's "hole" 25 .As a result, every molecule moves in the length of " α " at a frequency " f ".The frequency is set by the rate expression: where K , P , and R are stand for the Boltzmann (J/K), the Planck constant, and the gas constant (J/mole•k), respectively.T and G + 0 represent the molar activation energy and absolute temperature (K) of the fluid at rest.Additionally, a fluid traveling in the x-direction with a gradient of velocity dV x dy experiences molecular reconfigurations more often.The potential energy barrier, deformed by the applied stress τ yx is seen in Fig. 1 and will be expressed using the subsequent equation: where an estimate of how much work was performed on the molecules is shown by ±(γ / α) . This is the mole liquid volume denoted by Q .Equations ( 2) and (3) are then merged as follows: (2) Illustration of a liquid flow's escape mechanism.Molecule 1 has to go through a "bottleneck" in order to get to the vacant position. 25.
Vol:.(1234567890) Scientific Reports | (2024) 14:8666 | https://doi.org/10.1038/s41598-024-55147-wwww.nature.com/scientificreports/ The net velocity (Fig. 1) shows the separation between molecules in layer "A" and layer "B."The computation involves multiplying the net frequency of advancing jumps ( f + − f − ) by the distance travel in each jump ( α ).The frequency of forward and backward leaps are denotedby " f + " and " f − ".The following equation is used: Over a fairly small distance " α " between the two layers, a linear velocity profile may be observed, allowing: To sum up, Eqs. ( 4) and ( 6) are combined to form the following equation: 2 αTR ≪ 1 , the Taylor series can also be applied.Finally, the viscosity is derived using the following equation: The unity factor, γ α , makes the equation without compromising accuracy, since G + 0 is acquired empirically to ensure that the equation's values match the experimental results.However, it is demonstrated that, for a given fluid, the estimated G + 0 (free activation energies) are almost constant when fitting Eq. ( 8) to experimental viscosity values.This translates to the boiling point internal energy of vaporization U vap = H vap − RT Z vap , which is given by Eq. ( 9) as follows 63 : By using this empiricism and setting α γ = 1 , Eq. ( 8) becomes as follows when empiricism is set at α γ = 1: The following is an accurate estimate of the vaporization energy provided by the Trouton's rule at the typical boiling point: Equation (10), when approximated, reads as follows: where η indicates the expected viscosity (mPa•s) of pure ILs.N A and p , respectively, are the Avogadro number (mole −1 ) and the Plank constant (J•s).The Q represents the volume of a mole of liquid (m 3 mole −1 ), T b and T stands for the boiling temperature (K) and temperature (K), respectively.To promote the performance of Eq. ( 12), a " " term was added to Eq. (12) in Excel program for each IL in this study.This term is not constant; rather, it varies depending on ionic liquid.Empiricism η = Aexp(B/T) is compatible with eqs.(10) and ( 12) and appears to be a popular and useful approach.Viscosity decreases with temperature, according to the theory.

Group method of data handling (GMDH)
Ivakhnenko's data-management approach for groups matches Darwin's natural choice concept 52 .By merging two independent variables, the system chooses the optimal polynomial terms.The approach generates a generic multinomial term at each stage.The vast relationship multinomial Volterra-Kolmogorov-Gabor (VKG) analyzes the entire network 52 : In the above equation, the count of independent variables in the experiment is denoted by N v .From a set of measured data with N data points, a matrix can be generated.The measured results − → V y = y 1 , y 2 , . . ., y n are represented on the left-hand side of the matrix, while the independent variables − → V n = (x 1 , x 2 , . . ., x n ) are repre- sented on the right-hand side of the matrix.Both sides of the matrix are produced from the same set of data.
When two independent variables are coupled, a quadratic polynomial N v 2 can be used to estimate the actual data.Using N v parameters, here is a formula for N v 2 : The matrix of independent variables can here be built using the vector of new variables − → V z = (z 1 , z 2 , . . ., z n ) .To modify the parameters of equations, the least squares method is utilized (15).The objective is to maintain the square of the deviation from the actual data as small as possible in each column: In the above equation, N t denotes the count of datasets used.The measured data is used to construct training and testing subsets.The proportion of training and testing subsets is chosen at random.Equations are derived using the training set of data (15).The ideal set of parameters (z i ) .Variations from planned results must fulfill the following criteria, based on the predefined requirement: here, ε is an optional/random value.Just the z columns that meet the criteria are kept, whereas the ones that do not are deleted.The entire variation is preserved after each repetition and compared to the prior repetitions to check if the least variation has been achieved.

Genetic programming (GP)
GP is a breakthrough in optimization computing that combines traditional genetic methods with symbolic improvement [53][54][55] .It is predicated on an approach called "tree representation."This form is incredibly flexible since trees may represent full models of industrial systems, mathematical formulae, or computer programs.Creating model structures like differential equations, kinetic ordering, and steady-state models is best accomplished with this approach 56,57 .To achieve great variation, GP first creates an initial population, which consists of randomly selected individuals (trees).A new generation is finally formed by the software, which evaluates the individuals, selects individuals for reproduction, creates new individuals by mutation, crossover, and direct reproduction 57 .Unlike other optimization techniques, symbolic improvement uses the architectural arrangement of many symbols to convey workable solutions (that is, vectors of real values).

Statistical criteria
The models' validity was tested using the determination coefficient (R 2 ), standard deviation (SD), average absolute relative deviation (AARD%), average relative deviation percent (ARD%), and root mean square error (RMSE).Below are the statistical parameters: Determination Coefficient (R 2 ): R 2 is a regression coefficient that shows the model's accuracy.The model fits the data better if it is close to 1. R 2 s mathematical formula is as follows: Average Relative Deviation (ARD%): The relative deviation of the anticipated outcomes from the experimental data is determined using the ARD%: Positive and negative ARD (%) represents a model's underestimate and overestimate, respectively.Standard Deviation (SD): SD is a metric used to quantify the degree of dispersion of data around the central point.This has the following definition: Vol:.( 1234567890) The relative absolute deviation is used to quantify the difference between the actual or real data and the projected or represented data.It is shown by the equation that follows: Root Mean Square Error (RMSE): The RMSE is a frequently used statistical analysis approach for estimating the discrepancies between experimental and expected values.It goes by the name: When calculating the average IL viscosity using experimental/real data,the experimental/real viscosity (η) and the number of data points N P are represented by the variables "est", and "exp", respectively.

Graphical assessment of the models
Several graphical plots were used in this research to further evaluate the suggested models and measure their predicted performances.Among the visualization plots are diagrams showing the cumulative frequency and error distribution.In order to measure the distribution of error around the zero line and to indicate whether the model has a tendency to make mistakes, the percentage of relative deviation is displayed against target/real values in the error distribution.The cross-plot displays the estimated/represented value of the model in relation to the experimental data.After that, a slope line with a 45° unit is constructed to connect the experimental and represented/predicted values.A more accurate model is indicated by more data points that are shown along this line.The bulk of approximations will be inside a standard error range if the cumulative frequency is calculated from the absolute relative error.

Development of models
Using 2813 points of data (45 ionic liquids) collected from the literature, models were developed.Table 2's "Total" refers to the whole set of data (2813 data points) that were used for analysis and modeling in the current research.The database was split into training sets (which made up 80% of the overall dataset) and test sets (20% of the total dataset) at random.The 563 data points in the "testing" set were used to track over-fitting errors and the reliability of the built models.The "training" subset (2250 data sets) caused changes to the model's structure and tuning parameters.T, P, Mw, V c , T b , T c , P c and w were the input parameters, and IL viscosity was the output (Table 1).
To begin with, the GMDH method was used to build a new empirical correlation.The viscosity of ILs with 5, 6, and 7 inputs was found to be: 5 Inputs:     The critical temperature and pressure values of the IL are denoted Tc and Pc, respectively.There is also an acentric factor (w), temperature (T), and pressure (P), as well as IL molecular weight (Mw), critical volume (Vc), and IL boiling temperature (Tb).The other parameters are the adjustable correlation coefficients (Table 2).
The RMSE, SD, R 2 , and AARPE% for the proposed correlation are calculated for the GP and GMDH models in Table 2.
The cross-plots on the results of the experimental viscosity data and the predicted data for the given correlation are displayed in Fig. 2. Around the unit-slope line, this figure shows a medium-uniform distribution of forecasts.The viscosity of the ILs that were taken from the database was estimated using temperature (T) and boiling temperature (Tb), in accordance with Eyring's theory (Eq.13).AARD stands for 21.86%.The expected vs experimental IL viscosity is also plotted in a logarithmic cross-plot in Fig. 2. The data points were somewhat near the diagonal line, indicating moderate conformity.But data indicates that Arrhenius reliance does not match the experimental transport characteristics of ILs, which is why Eyring's theory does not hold up.In fact, ILs viscosity decreased as temperature rose, and this feature has to be changed by new model improvements.In order to define the thermal characteristics of ILs, the Vogel-Tamman-Fulcher (VTF) development is frequently used.This provides the basis for a complex energy landscape with several local potential energy minimums and a broad variety of energy barriers 58,59 .
The white-box machine learning models were carried out using GP and GMDH and compared to ET and Mousavi's model 25 ."White-box" models in machine learning are those that are easy for experts in the application area to understand.These models, in general, provide a fair mix between explainability and accuracy.The numerical assessment of the created methods is presented in Table 3.With the use of a GMDH optimizer, it was shown that the usage of seven inputs was the best design for forecasting the viscosity of ILs since it can anticipate the whole data collection with more accuracy than other approaches (AARD% = 8.14).

Graphical error analysis
A number of graphical error evaluations developed from the GP and GMDH models were examined in order to offer a more lucid evaluation of the models' efficacy.To evaluate the models, the predicted viscosity measurements were compared to the experimental measurements shown in Fig. 3(a-c).For the GP and GMDH models, there is a high formation of points around the unit slope in both the test and training datasets.The observed viscosity values are shown to be more accurate by the GMDH correlations than by the GP and Mousavi et al. correlations (Fig. 3c).As indicated in Fig. 3, the data distribution for GMDH correlation with 7 inputs is more on the slop line than the GMDH data with 5 and 6 inputs.The GMDH decreases overall relative deviation, resulting in the smallest error margin.
The AARE% of the white-box machine learning models is shown against the number of input parameters in Fig. 4. Comparing the GP model with experimental data indicated that it was less accurate, less flexible, and less well-suited than the GMDH model.Furthermore, the GMDH model with 7 inputs showed higher accuracy with experimental viscosity data in comparison to the GMDH trained on 5 and 6 inputs.
To show the models' level of competence, comparative graphs are used, such as cumulative frequency plots.The GMDH has the maximum cumulative frequency for a given absolute relative deviation, as seen in Fig. 5.To put it another way, the GMDH model predicted almost 70% of the data points as we got closer to the ARD of less than 4%, but the corresponding values for the GP models were 9%, 11%, and 10%, respectively.
Figure 6 presents a comparison of the created models with respect to their relative deviations.The model's ability to precisely predict the viscosity of ILs is demonstrated by the dense cluster of dots surrounding the zero line.As can be seen, the GMDH model with 5, 6, and 7 inputs estimated viscosity better than GP, Eyring theory, and Mousavi et al. correlations.
Figure 7 compares the acentric factor, molecular weight, boiling temperature, critical temperature / pressure, temperature, and pressure impacts (7 inputs) on AARD (%) for the GMDH and GP models under investigation.According to our findings, the GP model is more sensitive to changes, which leads to greater parameter values than the GMDH model, which was shown to be less susceptible.The GP model, for instance, is very temperaturesensitive (Fig. 7a).Thus, the GMDH model may be applied in a wide range of temperatures with a lower relative error of ARE <15%, whereas the GP model can only be utilized in a narrow range of temperatures (381-445 K) with a minimum relative error of 14.6%.
AARD values of 8.14% and 25.76% for the 7 inputs are displayed in Table 3 and are thus retained for future analyses since they are among the best responses for the GMDH and GP models.Based on the GMDH and GP correlations, Fig. 8 shows how temperature and pressure affect 1-ethyl-3-methylimidazolium hexafluorophosphate.The anticipated viscosity of ILs using both models is consistent with the experimental dataset, as expected.Viscosity assessments for ILs using GP correlations are, in turn, inconsistent, as seen in Fig. 8, and come with large error margins.As can be observed in Fig. 8b, there is a physical link between the temperature and the GMDH model; but, as the pressure increases, neither model can adequately represent the experimental data.

Identifying outliers in experimental data, GMDH, and GP models
Finding data that significantly differs from the bulk of the data in a database is the aim of outlier (or aberrant) identification 60,61 .Leverage is a well-known approach for doing this 60,62 .Standardized residuals (R) and the Hat matrix (H) are used 62 .The R value for each data point can be found using the below equation:  MSE stands for mean square error (MSE), while the i th data point's error and ii th Hat indices (Leverage) are represented by z i and H ii 63 .In addition, the following formula may be used to calculate Hat index (or Leverage): 64 here, X represents a two-dimensional q × w matrix (where "q" shows the number of data and "w" is the count of input variables).Also, X t is transpose of matrix.The outliers were investigated using the Williams plot after the R and H values were measured.In addition, the Leverage limit (H*), a parameter defined as 3a/b, where b stands for the count of data points and a is the number of model parameters plus one, is applied in this approach.
The calculated R values must be within [− 3, + 3] standard deviations in order to encompass 99.7% of the normally distributed data 17,62 .The model is statistically valid if a significant proportion of data points are in the range of H * ≥ H ≥ 0 and 3 ≥ R ≥ −3 17 .Since they are highly expected yet outside of the application domain, data points in the range of −3 ≤ R ≤ 3 and H * ≤ H are referred to as "Good High Leverage" points.Conversely, data points with R values larger than or less than -3 are referred to as "Bad High Leverage" data points.These regions are beyond the applicability range of the model and have significant levels of uncertainty.It is clear that reliable data significantly affect the GMDH (7 inputs) model's performance, making it the best model used in this study.The H* value, as per the suggested model, was 0.0085.The GMDH model's Williams plot is shown in Fig. 9 into the statistically significant range of 0 ≤ H ≤ 0.0085 and −3 ≤ R ≤ 3, all data points appear to fit into the established GMDH model.Less residual value normalization leads to an increase in reliability.However, (29) H = X X t X −1 X t www.nature.com/scientificreports/However, 24 suspicious data points, or fewer than 1% of the total data in Fig. 9, either R < − 3 or R > 3, making them outliers with considerable uncertainty.Furthermore, 77 data points, or 3% of all data had H > 0.0085.These points are all in the range of −3 ≤ R ≤ 3 , which indicates that they are all Good High Leverage regardless of their Hat (Leverage) values.

Variables' relative importance
When taking the GMDH model, all input variables were tested to see how much of an influence they had on the viscosity of ILs.The relative significance of the inputs with respect to one another is shown in Fig. 10.One measure used to evaluate each input parameter's impact on the pure viscosity of ILs as a model output is the where n represents the number of datasets.Also, the j-th value, and the mean of the I-th input are respectively represented by the variables,I i,j , and I i .Whereas η denotes the average value of the predicted ILs viscosity, while η j represnts the j-th value of the represented/expected viscosity.Based on the GMDH (as the output), Fig. 10 displays the relative effects of each parameter on the pure viscosity of ILs.It is demonstrated that temperature and the acentric factor significantly affect the model's output.Viscosity increases with an increase in pressure or acentric factor in pure ionic liquids.As Fig. 10 illustrates, increasing T, M w , V c , T b , T c , and P c parameters will result in a decrease in the viscosity of ILs, since they have negative relevance factors.Moreover, the temperature has the most significant effect on the viscosity of ILs compared to other inputs.
We compared our models to a nonlinear artificial neural network (ANN) model using a dataset of 8,523 IL-water mixture viscosity data points 16 .The assessment included critical performance indicators such as mean absolute error (MAE) and R-squared (R 2 ).The results show that the GP and GMDH models have equivalent, if not greater, prediction accuracy, with benefits in simplicity and interpretability.This comparative research not only supports our models' effectiveness but also highlights their potential as reliable methods for forecasting IL viscosity.Our target in this research was in line with the goal of obtaining better predictions about the physical features of ILs 15 .The main concern, though, is making accurate predictions regarding the viscosity of pure IL.We used the genetic programming (GP) and group method of data handling (GMDH) techniques to do this.In particular, our study adds custom models with clear benefits, focusing on accuracy and ease of use in determining the viscosity of pure ILs.

Conclusions
The GMDH model was obtained by modeling 2813 experimental findings from 45 ILs based on temperature, pressure, molecular weight, critical volume, and acentric factor.Furthermore, IL viscosity was calculated using temperature and boiling temperature in accordance with Eyring's hypothesis.There were statistical and graphical comparisons between GMDH and experimental data in order to evaluate the model's efficacy.AARD, ARD, RMSE, and R 2 parameters indicated that the GMDH model performed rather well.Using the relevance factor, the impact of input characteristics on the model's target parameter was also investigated.The relevance factor illustrated that the temperature is the most important parameter affecting ILs viscosity.Finally, the employed dataset's reliability and validity were assessed using the leverage statistics.In our case, Williams' plot was applied to study the established paradigm's applicability domain and data collection.Only a small number of data points were found to be outside the realm of applicability.In light of all the above, the developed GMDH model is able to accurately forecast IL viscosity and obtain IL physicochemical parameters in different chemical engineering processes.

Figure 7 .
Figure 7. AARE for the correlations between the GMDH and GP correlations with 7 inputs.Temperature; pressure; critical temperature; critical pressure; boiling temperature; acentric factor; and molecular weight are represented by (a-g).

Table 1 .
Dataset statistics acquired in this work.

Table 2 .
Calculated the statistical requirements for the developed correlations.
Vol:.(1234567890) Scientific Reports | (2024) 14:8666 | https://doi.org/10.1038/s41598-024-55147-wwww.nature.com/scientificreports/ To illustrate the error margin, many statistical metrics were computed for both created models with different inputs, including ARD, AARD, RMSE, SD, and R 2 .As more input data was provided, the AARD, ARD, SD, and RMSE values for the test and training datasets decreased, as seen in Table3for the GP and GMDH models. .The following is a breakdown of the models based on how accurate they are: Mousavi et al. correlation < Eyring theory < GP < GMDH.As a result, the GMDH model may produce more reliable estimates than the other established models.

Table 3 .
Statistical comparison between the Eyring modified model correlations and new developed correlation model with various inputs.